智能论文笔记

Active Learning with Safety Constraints

Romain Camilleri , Andrew Wagenmaker , Jamie Morgenstern , Lalit Jain , Kevin Jamieson

分类：机器学习 | (统计)机器学习

2022-06-22

积极的学习方法在减少学习所需的样本数量方面表现出了巨大的希望。随着自动化学习系统被采用到实时的现实世界决策管道中，越来越重要的是，这种算法的设计考虑到了安全性。在这项工作中，我们研究了在互动环境中学习最佳安全决定的复杂性。我们将这个问题减少到约束的线性匪徒问题，我们的目标是找到满足某些（未知）安全限制的最佳手臂。我们提出了一种基于自适应的实验性设计算法，在显示ARM的难度与次优的难度之间，我们表现出了有效的交易。据我们所知，我们的结果是具有安全限制的线性匪徒最佳武器识别。实际上，我们证明了这种方法在合成和现实世界数据集上的表现很好。

translated by 谷歌翻译

Nearly Optimal Algorithms for Level Set Estimation

Blake Mason , Romain Camilleri , Subhojyoti Mukherjee , Kevin Jamieson , Robert Nowak , Lalit Jain

分类： (统计)机器学习 | 机器学习

2021-11-02

级别设置估计问题旨在查找域$ {\ cal x} $的所有点，其中一个未知函数$ f：{\ cal x} \ lightarrow \ mathbb {r} $超过阈值$ \ alpha $ 。估计基于可以在$ {\ cal x} $中顺序和自适应地选择的位置获取的嘈杂函数评估。阈值$ \ alpha $可以是\弹性{显式}，并提供先验，或\ \ ich {隐式}，相对于最佳函数值定义，即$ \ alpha =（1- \ epsilon）f（x_ \ AST）$关于给定$ \ epsilon> 0 $ why $ f（x_ \ ist）$是最大函数值，并且未知。在这项工作中，我们通过将其与最近的自适应实验设计方法相关联，为近期自适应实验设计方法提供了一种新的再现内核盗窃空间（RKHS）设置。我们假设可以通过RKHS中的函数近似于未知的拼写，并为此设置中隐含和显式案件提供新的算法，具有很强的理论保证。此外，在线性（内核）设置中，我们表明我们的界限几乎是最佳的，即，我们的上限与阈值线性匪徒的现有下限匹配。据我们所知，这项工作提供了第一个实例依赖性非渐近的上限，就匹配信息理论下限的水平设定估计的样本复杂性。

translated by 谷歌翻译

Selective Sampling for Online Best-arm Identification

Romain Camilleri , Zhihan Xiong , Maryam Fazel , Lalit Jain , Kevin Jamieson

分类：机器学习

2021-10-28

这项工作考虑了最佳手臂识别的选择性采样问题。给定一组潜在选项$ \ mathcal {z} \ subset \ mathbb {r} ^ d $，学习者旨在计算概率大于1- \ delta $，$ \ arg \ max_ {z \ mathcal { z}} z ^ {\ top} \ theta _ {\ ast} $ where $ \ theta _ {\ art} $未知。在每个时间步骤中，潜在的测量$ x_t \ in \ mathcal {x} \ subset \ mathbb {r} ^ d $被绘制的iid，学习者可以选择采取测量，在这种情况下，他们观察到嘈杂的测量$ x ^ {\ top} \ theta _ {\ ast} $，或弃权采取测量并等待可能更多的信息点到达流。因此，学习者在他们采取的标签样本数量之间面临的基本折衷，并且当他们收集足够的证据来宣布最好的手臂并停止抽样时。这项工作的主要结果精确地表征了标记的样本和停止时间之间的这种权衡，并提供了一种算法，几乎最佳地实现了给出所需停止时间的最小标签复杂性。此外，我们表明最佳决策规则具有基于决定点是否处于椭圆形的简单几何形式。最后，我们的框架足以捕获先前作品的二进制分类。

translated by 谷歌翻译

Similarity Contrastive Estimation for Image and Video Soft Contrastive Self-Supervised Learning

Julien Denize , Jaonary Rabarisoa , Astrid Orcesi , Romain Hérault

分类：计算机视觉 | 人工智能 | 机器学习

2022-12-21

Contrastive representation learning has proven to be an effective self-supervised learning method for images and videos. Most successful approaches are based on Noise Contrastive Estimation (NCE) and use different views of an instance as positives that should be contrasted with other instances, called negatives, that are considered as noise. However, several instances in a dataset are drawn from the same distribution and share underlying semantic information. A good data representation should contain relations between the instances, or semantic similarity and dissimilarity, that contrastive learning harms by considering all negatives as noise. To circumvent this issue, we propose a novel formulation of contrastive learning using semantic similarity between instances called Similarity Contrastive Estimation (SCE). Our training objective is a soft contrastive one that brings the positives closer and estimates a continuous distribution to push or pull negative instances based on their learned similarities. We validate empirically our approach on both image and video representation learning. We show that SCE performs competitively with the state of the art on the ImageNet linear evaluation protocol for fewer pretraining epochs and that it generalizes to several downstream image tasks. We also show that SCE reaches state-of-the-art results for pretraining video representation and that the learned representation can generalize to video downstream tasks.

translated by 谷歌翻译

Hidden-Variables Genetic Algorithm for Variable-Size Design Space Optimal Layout Problems with Application to Aerospace Vehicles

Juliette Gamot , Mathieu Balesdent , Arnault Tremolet , Romain Wuilbercq , Nouredine Melab , El-Ghazali Talbi

分类：人工智能

2022-12-21

The optimal layout of a complex system such as aerospace vehicles consists in placing a given number of components in a container in order to minimize one or several objectives under some geometrical or functional constraints. This paper presents an extended formulation of this problem as a variable-size design space (VSDS) problem to take into account a large number of architectural choices and components allocation during the design process. As a representative example of such systems, considering the layout of a satellite module, the VSDS aspect translates the fact that the optimizer has to choose between several subdivisions of the components. For instance, one large tank of fuel might be placed as well as two smaller tanks or three even smaller tanks for the same amount of fuel. In order to tackle this NP-hard problem, a genetic algorithm enhanced by an adapted hidden-variables mechanism is proposed. This latter is illustrated on a toy case and an aerospace application case representative to real world complexity to illustrate the performance of the proposed algorithms. The results obtained using the proposed mechanism are reported and analyzed.

translated by 谷歌翻译

JEMMA: An Extensible Java Dataset for ML4Code Applications

Anjan Karmakar , Miltiadis Allamanis , Romain Robbes

分类：机器学习

2022-12-18

Machine Learning for Source Code (ML4Code) is an active research field in which extensive experimentation is needed to discover how to best use source code's richly structured information. With this in mind, we introduce JEMMA, an Extensible Java Dataset for ML4Code Applications, which is a large-scale, diverse, and high-quality dataset targeted at ML4Code. Our goal with JEMMA is to lower the barrier to entry in ML4Code by providing the building blocks to experiment with source code models and tasks. JEMMA comes with a considerable amount of pre-processed information such as metadata, representations (e.g., code tokens, ASTs, graphs), and several properties (e.g., metrics, static analysis results) for 50,000 Java projects from the 50KC dataset, with over 1.2 million classes and over 8 million methods. JEMMA is also extensible allowing users to add new properties and representations to the dataset, and evaluate tasks on them. Thus, JEMMA becomes a workbench that researchers can use to experiment with novel representations and tasks operating on source code. To demonstrate the utility of the dataset, we also report results from two empirical studies on our data, ultimately showing that significant work lies ahead in the design of context-aware source code models that can reason over a broader network of source code entities in a software project, the very task that JEMMA is designed to help with.

translated by 谷歌翻译

Reproducible scaling laws for contrastive language-image learning

Mehdi Cherti , Romain Beaumont , Ross Wightman , Mitchell Wortsman , Gabriel Ilharco , Cade Gordon , Christoph Schuhmann , Ludwig Schmidt , Jenia Jitsev

分类：机器学习 | 人工智能 | 计算机视觉

2022-12-14

Scaling up neural networks has led to remarkable performance across a wide range of tasks. Moreover, performance often follows reliable scaling laws as a function of training set size, model size, and compute, which offers valuable guidance as large-scale experiments are becoming increasingly expensive. However, previous work on scaling laws has primarily used private data \& models or focused on uni-modal language or vision learning. To address these limitations, we investigate scaling laws for contrastive language-image pre-training (CLIP) with the public LAION dataset and the open-source OpenCLIP repository. Our large-scale experiments involve models trained on up to two billion image-text pairs and identify power law scaling for multiple downstream tasks including zero-shot classification, retrieval, linear probing, and end-to-end fine-tuning. We find that the training distribution plays a key role in scaling laws as the OpenAI and OpenCLIP models exhibit different scaling behavior despite identical model architectures and similar training recipes. We open-source our evaluation workflow and all models, including the largest public CLIP models, to ensure reproducibility and make scaling laws research more accessible. Source code and instructions to reproduce this study will be available at https://github.com/LAION-AI/scaling-laws-openclip

translated by 谷歌翻译

Domain Translation via Latent Space Mapping

Tsiry Mayet , Simon Bernard , Clement Chatelain , Romain Herault

分类：机器学习 | 计算机视觉

2022-12-06

In this paper, we investigate the problem of multi-domain translation: given an element $a$ of domain $A$, we would like to generate a corresponding $b$ sample in another domain $B$, and vice versa. Acquiring supervision in multiple domains can be a tedious task, also we propose to learn this translation from one domain to another when supervision is available as a pair $(a,b)\sim A\times B$ and leveraging possible unpaired data when only $a\sim A$ or only $b\sim B$ is available. We introduce a new unified framework called Latent Space Mapping (\model) that exploits the manifold assumption in order to learn, from each domain, a latent space. Unlike existing approaches, we propose to further regularize each latent space using available domains by learning each dependency between pairs of domains. We evaluate our approach in three tasks performing i) synthetic dataset with image translation, ii) real-world task of semantic segmentation for medical images, and iii) real-world task of facial landmark detection.

translated by 谷歌翻译

Codex Hacks HackerRank: Memorization Issues and a Framework for Code Synthesis Evaluation

Anjan Karmakar , Julian Aron Prenner , Marco D'Ambros , Romain Robbes

分类：机器学习

2022-12-06

The Codex model has demonstrated extraordinary competence in synthesizing code from natural language problem descriptions. However, in order to reveal unknown failure modes and hidden biases, such large-scale models must be systematically subjected to multiple and diverse evaluation studies. In this work, we evaluate the code synthesis capabilities of the Codex model based on a set of 115 Python problem statements from a popular competitive programming portal: HackerRank. Our evaluation shows that Codex is indeed proficient in Python, solving 96% of the problems in a zero-shot setting, and 100% of the problems in a few-shot setting. However, Codex exhibits clear signs of generating memorized code based on our evaluation. This is alarming, especially since the adoption and use of such models could directly impact how code is written and produced in the foreseeable future. With this in mind, we further discuss and highlight some of the prominent risks associated with large-scale models of source code. Finally, we propose a framework for code-synthesis evaluation using variations of problem statements based on mutations.

translated by 谷歌翻译

Misinformation Detection using Persuasive Writing Strategies

Joseph Romain , Huiyi Liu , Wei Peng , Jingbo Meng , Parisa Kordjamshidi

分类：自然语言处理 | 人工智能 | 机器学习

2022-11-11

The spread of misinformation is a prominent problem in today's society, and many researchers in academia and industry are trying to combat it. Due to the vast amount of misinformation that is created every day, it is unrealistic to leave this task to human fact-checkers. Data scientists and researchers have been working on automated misinformation detection for years, and it is still a challenging problem today. The goal of our research is to add a new level to automated misinformation detection; classifying segments of text with persuasive writing techniques in order to produce interpretable reasoning for why an article can be marked as misinformation. To accomplish this, we present a novel annotation scheme containing many common persuasive writing tactics, along with a dataset with human annotations accordingly. For this task, we make use of a RoBERTa model for text classification, due to its high performance in NLP. We develop several language model-based baselines and present the results of our persuasive strategy label predictions as well as the improvements these intermediate labels make in detecting misinformation and producing interpretable results.

translated by 谷歌翻译